9/14/2020

Agenda

  • Simple Linear regression
  • Multiple Linear regression
  • Qualitative predictors (dummy variables)
  • Extensions: interactions, nonlinear effects
  • Potential issues: outliers and assumption verification (residual plots and transformations)

Recap

Multiple linear regression

We can can write down \(f(\mathbf{X})\) in the form of an equation: \[y = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p + \varepsilon\]

We interpret \(\beta_j\) as the average effect on \(Y\) of a one unit increase in \(X_j\) , holding all other predictors fixed. But predictors usually change together!

Assumption: \(\varepsilon_1, \dots, \varepsilon_n \stackrel{iid}{\sim} \mathcal{N}(0, \sigma^2)\)

Multiple linear regression

t-test for individual coefficients

  • \(H_0: \beta_j = 0\)
  • \(H_a: \beta_j \neq 0\)

Intervals and testing via \(\hat{\beta}_j\) and \(\text{SE}(\hat{\beta}_j)\) are one-at-a-time procedures.

F-test for significance of the model

  • \(H_0: \beta_1 = \dots = \beta_p = 0\)
  • \(H_a: \exists \ j \text{ s.t. } \beta_j \neq 0\)

We can use the F-statistic:

\[F = \dfrac{(\text{TSS} - \text{RSS})/p}{\text{RSS}/(n-p-1)} \sim F_{p,n-p-1}\]

What to check in R

## 
## Call:
## lm(formula = Y ~ X1 + X2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3651 -3.3037 -0.6222  3.1068 10.3991 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.05424    1.87387   6.433 4.74e-09 ***
## X1           0.46707    0.26217   1.782    0.078 .  
## X2          -0.97619    0.09899  -9.861 2.68e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.756 on 97 degrees of freedom
## Multiple R-squared:  0.5136, Adjusted R-squared:  0.5035 
## F-statistic:  51.2 on 2 and 97 DF,  p-value: 6.625e-16
  • The MLE is \(\hat{\beta}_0 = 12.05, \hat{\beta}_1 = 0.47, \hat{\beta}_2 = -0.98\)
  • Given \(x_1^{\star} = 10, x_2^{\star} = -5\), \(\begin{aligned}\hat{y}^{\star} &= \hat{\beta}_0 + \hat{\beta}_1 x^{\star}_1 + \hat{\beta}_2 x^{\star}_2 \\ &= 12.05 + 4.7 + 4.9 = 11.65\end{aligned}\)
  • \(X_{1}\) is only weakly significant: maybe we can remove it?
  • \(s = \sqrt{\text{RSS} / (n-p-1)} = 4.756\)
  • \(R^{2} = 0.51\)
  • The F-test is significant: at least one coefficient is significant

Qualitative predictors

Predictors with only two levels

Example: investigate difference in credit card balance between males and females, ignoring the other variables.

We replace the variable gender with the following “dummy” variable: \[x_i = \begin{cases} 0 & \text{if $i^{th}$ person is male} \\ 1 & \text{if $i^{th}$ person is female} \end{cases}\]

Resulting model: \[y_i = \beta_0 + \beta_1 x_i + \varepsilon_i = \begin{cases} \beta_0 + \varepsilon_i & \text{if $i^{th}$ person is male} \\ (\beta_0 + \beta_1) + \varepsilon_i & \text{if $i^{th}$ person is female} \end{cases}\]

Interpretation?

Predictors with more than two levels

We want to evaluate the difference in house prices in a few different neighborhoods.

##       Size       Neigh    Price
## 1 1.062684   Riverside 109.5524
## 2 2.461323   Hyde Park 170.7928
## 3 2.425506   Hyde Park 176.5540
## 4 2.464293   Riverside 208.7832
## 5 3.117517   Hyde Park 202.9831
## 6 2.510854 West Campus 175.3222

Predictors with more than two levels

Let’s create the dummy variables “Neigh West Campus” and “Neigh Hyde Park”.

##       Size       Neigh    Price
## 1 1.062684   Riverside 109.5524
## 2 2.461323   Hyde Park 170.7928
## 3 2.425506   Hyde Park 176.5540
## 4 2.464293   Riverside 208.7832
## 5 3.117517   Hyde Park 202.9831
## 6 2.510854 West Campus 175.3222
##   Intercept     Size Neigh West Campus Neigh Hyde Park
## 1         1 1.062684                 0               0
## 2         1 2.461323                 0               1
## 3         1 2.425506                 0               1
## 4         1 2.464293                 0               0
## 5         1 3.117517                 0               1
## 6         1 2.510854                 1               0

With more than two levels, we create additional dummy variables. For example, for the neighborhood variable we create two dummy variables:

\[\begin{aligned} &x_{i1} = \begin{cases} 0 & \text{if $i^{th}$ house is in West Campus} \\ 1 & \text{if $i^{th}$ house is not in West Campus} \end{cases} \\ &x_{i2} = \begin{cases} 0 & \text{if $i^{th}$ house is in Hyde Park} \\ 1 & \text{if $i^{th}$ house is not in Hyde Park} \end{cases} \end{aligned}\]

Predictors with more than two levels

Then both of these variables can be used in the regression equation, in order to obtain the model:

\[\begin{aligned} y_i &= \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 \text{size} + \varepsilon \\ &\ \\ &= \begin{cases} \beta_0 + \beta_3 \text{size} + \varepsilon & \text{if $i^{th}$ house is in Riverside} \\ (\beta_0 + \beta_1) + \beta_3 \text{size} + \varepsilon & \text{if $i^{th}$ house is in West Campus} \\ (\beta_0 + \beta_2) + \beta_3 \text{size} + \varepsilon & \text{if $i^{th}$ house is in Hyde Park} \end{cases} \end{aligned}\]

General rule: one fewer dummy variable than the number of levels. The level with no dummy variable - “Riverside” in this example - is known as the baseline.

Predictors with more than two levels

Predictors with more than two levels

\[\begin{aligned} y_i &= \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 \text{size} + \varepsilon \\ &\ \\ &= \begin{cases} \beta_0 + \beta_3 \text{size} + \varepsilon & \text{if $i^{th}$ house is in Riverside} \\ (\beta_0 + \beta_1) + \beta_3 \text{size} + \varepsilon & \text{if $i^{th}$ house is in West Campus} \\ (\beta_0 + \beta_2) + \beta_3 \text{size} + \varepsilon & \text{if $i^{th}$ house is in Hyde Park} \end{cases} \end{aligned}\]

How to test for a neighbourhood effect?

\[\beta_1 = \beta_2 = 0\]

Confidence interval for the intercept parameters of a house in Hyde Park? It is a linear combination of the parameters!

\[\beta_0 + \beta_2\]

Extensions

Adding interaction terms

In our previous analysis of the Advertising data, we assumed that the effect on sales of increasing one advertising medium is independent of the amount spent on the other media.

For example, the linear model \[\text{Sales} = \beta_0 + \beta_1 \text{TV} + \beta_2 \text{Radio} + \varepsilon\] states that the average effect on sales of a one-unit increase in TV is always \(\beta_1\), regardless of the amount spent on radio.

Say that \(\hat{\beta}_{1} = 0.4, \hat{\beta}_{2} = 0.2\). Given a fixed budget of \(\$100,000\), how would you allocate such budget to maximize sales?

Adding interaction terms

  • Now, suppose that spending money on radio advertising actually increases the effectiveness of TV advertising, so that the slope term for TV should increase as radio increases
  • In this situation, given a fixed budget of \(\$100,000\), spending half on radio and half on TV may increase sales more than allocating the entire amount to either TV or to radio
  • In marketing, this is known as a synergy effect, and in statistics it is referred to as an interaction effect

The model takes the form

\[\text{Sales} = \beta_0 + \beta_1 \text{TV} + \beta_2 \text{Radio} + \beta_3 \text{TV} \times \text{Radio} + \varepsilon\]

If the p-value for the interaction term \(\text{TV} \times \text{Radio}\) is small, then there is strong evidence for \(\beta_3 \neq 0\).

Adding interaction terms

\[\text{Sales} = \beta_0 + \beta_1 \text{TV} + \beta_2 \text{Radio} + \beta_3 \text{TV} \times \text{Radio} + \varepsilon\]

Say that \(\hat{\beta}_0 = 6.750, \hat{\beta}_1 = 0.019, \hat{\beta}_2 = 0.029, \hat{\beta}_3 = 0.0011\)

  • An increase in TV advertising of \(\$1,000\) is associated with increased sales of \((\hat{\beta}_1 + \hat{\beta}_3 × \text{radio}) \times 1000 = 19 + 1.1 \times \text{radio units}\)
  • An increase in radio advertising of \(\$1,000\) will be associated with an increase in sales of \((\hat{\beta}_2 + \hat{\beta}_3 \times \text{TV}) \times 1000 = 29 + 1.1 \times \text{TV units}\)

Hierarchy principle

  • Sometimes it is the case that an interaction term has a very small p-value, but the associated main effects (in this case, TV and radio) do not
  • The hierarchy principle: if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant
  • The rationale for this principle is that interactions are hard to interpret in a model without main effects - their meaning is changed
  • Specifically, the interaction terms also contain main effects, if the model has no main effect terms

Interactions between qualitative and quantitative variables

Back to the Austin house data (simplified).

Interactions between qualitative and quantitative variables

The previous version of the model was

\[\begin{aligned} y_i &= \beta_0 + \beta_1 \text{HP} + \beta_2 \text{Size} + \varepsilon \\ &\ \\ &= \begin{cases} \beta_0 + \beta_2 \text{Size} + \varepsilon & \text{if $i^{th}$ house is not in Hyde Park} \\ (\beta_0 + \beta_1) + \beta_2 \text{Size} + \varepsilon & \text{if $i^{th}$ house is in Hyde Park} \end{cases} \end{aligned}\]

Interactions between qualitative and quantitative variables

Adding an interaction between neighbourhood and size leads to

\[\begin{aligned} y_i &= \beta_0 + \beta_1 \text{HP} + \beta_2 \text{Size} + \beta_3 \text{HP} \times \text{Size} + \varepsilon \\ &\ \\ &= \begin{cases} \beta_0 + \beta_2 \text{Size} + \varepsilon & \text{if $i^{th}$ house is not in Hyde Park} \\ (\beta_0 + \beta_1) + (\beta_2 + \beta_3) \text{Size} + \varepsilon & \text{if $i^{th}$ house is in Hyde Park} \end{cases} \end{aligned}\]

Interactions between qualitative and quantitative variables

Adding non linear terms

Adding non linear terms

A simple approach for incorporating non-linear associations in a linear model is to include transformed versions of the predictors in the model, e.g.

\[\text{mpg} = \beta_0 + \beta_1 \text{horsepower} + \beta_2 \text{horsepower}^2 + \varepsilon\]

We are predicting mpg using a non-linear function of horsepower, but it is still a linear model with \(X_1 = \text{horsepower}\) and \(X_2 = \text{horsepower}^2\).

  • We can use standard linear regression software to estimate the parameters
  • Polynomial regression: extending the linear model to accommodate non-linear relationships

Adding non linear terms

Other models

We can also use the linear model machinery to fit an exponential curve \[Y = a b^{X} + \varepsilon.\]
In fact, taking the log, we get \[\log (Y) = \underbrace{\log(a)}_{\beta_{0}} + \underbrace{\log(b)}_{\beta_{1}} X + \log(\varepsilon).\] So it basically correspond to fitting the linear model to the transformed data \[\{(x_{1}, \log(y_{1}); \dots; (x_{n}, \log(y_{n})\}\]

Potential issues

Non-linearity of the data

  • Residual plots are a useful graphical tool for identifying non-linearity. See residual plots of Auto data below
  • A simple approach is to use non-linear transformations of the predictors, such as \(\log(X)\), \(\sqrt{X}\), and \(X^2\)

Correlation of error terms

An important assumption of the linear regression model is that the error terms, \(\varepsilon_1, \varepsilon_2, \dots, \varepsilon_n\) are uncorrelated.

  • Underestimate uncertainty (standard errors, confidence intervals, …)
  • Correlations frequently occur in the context of time series data, identified by seeing tracking in the residuals
  • Good experimental design can help mitigate the risk of such correlations

Correlation of error terms

If the errors are uncorrelated, then the fact that \(\varepsilon_i\) is positive provides little or no information about the sign of \(\varepsilon_{i+1}\).

Non-constant variance of error terms

Another important assumption of the linear regression model is that the error term have a constant variance, \(\text{Var}(\varepsilon_i) = \sigma^2\).

  • One can identify non-constant variances in the error (heteroscedasticity), from the presence of a funnel shape in the residual plot
  • One possible solution is to transform the response \(Y\) using a concave function such \(\log(Y)\) or \(\sqrt{Y}\)

Outliers

  • An outlier is a point for which \(y_i\) is far from the value predicted by the model
  • Outliers can be identified by plot of studentized residuals
  • Outliers are usually removed even if they have little effect on the least squares line, because they will inflate the estimate of RSE

High leverage points

  • Leverage points have unusual values for \(x_i\), and have substantial impact on the least squares line
  • Can be identified by calculating the leverage statistic \(h_i\) and compare it to \((p+1)/n\)

\[h_i = \frac{1}{n} + \frac{(x_i - \overline{x})^2}{\sum_{j=1}^n (x_j - \overline{x})^2}\]

Collinearity

  • Collinearity refers to the situation when two or more predictors are highly correlated
  • The presence of collinearity makes it difficult to separate out individual effects of collinear predictors on the response, and also inflates the estimated SE of coefficients
  • Pairwise collinearity can be identified by pairwise scatterplot or correlation matrix of predictors. Multicollinearity can be identified by a large variance inflation factor (if it exceeds 5 or 10) \[\text{VIF} (\hat{\beta}_j) = \dfrac{1}{1 - R^2_{X_j \mid X_{-j}}}\]
  • Solution can be either drop one predictor or combine the two predictors as one

Summary

When building a regression model remember that simplicity is your friend. Smaller models are easier to interpret and have fewer unknown parameters to be estimated.

Keep in mind that every additional parameter represents a cost!

The first step of every model building exercise is the selection of the the universe of variables to be potentially used. This task is entirely solved through you experience and context specific knowledge:

  • Think carefully about the problem
  • Consult subject matter research and experts
  • Avoid the mistake of selecting too many variables

Summary

With a universe of variables in hand, the goal now is to select the model. Why not include all the variables in?

Big models tend to over-fit and find features that are specific to the data in hand, i.e. not generalizable relationships.

The results are bad predictions and bad science!

In addition, bigger models have more parameters and potentially more uncertainty about everything we are trying to learn.

We need a strategy to build a model in ways that accounts for the trade-off between fitting the data and the uncertainty associated with the model: subset selection, shrinkage, dimension reduction.

Airline data

Monthly passengers in the US airline industry (in \(1,000\) of passengers) from 1949 to 1960. Predict the number of passengers in the next couple of months.

Airline data

How about a “trend model”? \(Y_t = \beta_0 + \beta_1 t + \varepsilon_t\)

Airline data

Let’s look at the residuals: are there obvious patterns?

Airline data

The variance of the residuals seems to be growing in time. Let’s try to use \(\log(Y_t) = \beta_0 + \beta_1 t + \varepsilon_t\)

Airline data

Let’s look at the residuals: are there obvious patterns?

Airline data

Seasonal patterns? Let’s now add dummy variables for each month (only 11 dummies), i.e. \(\log(Y_t) = \beta_0 + \beta_1 t + \beta_2 \text{Feb} + \dots + \beta_{12} \text{Dec} + \varepsilon_t\)

Airline data

Still not happy, they do not look normal iid!

Question time